The folder RiskyBehavior contains data from a randomized trial targeting couples at high risk of HIV infection. The intervention provided counseling sessions regarding practices that could reduce their likelihood of contracting HIV. Couples were randomized either to a control group, a group in which just the woman participated, or a group in which both members of the couple participated. One of the outcomes examined after three months was “number of unprotected sex acts.”
Model this outcome as a function of treatment assignment using a Poisson regression. Does the model fit well? Is there evidence of overdispersion?
risk <- read.csv("https://raw.githubusercontent.com/avehtari/ROS-Examples/master/RiskyBehavior/data/risky.csv",header=T)
risk$fupacts_R = round(risk$fupacts)
fit1 <- stan_glm(fupacts_R ~ couples + women_alone, family = poisson(link = "log"), data = risk, refresh = 0)
pp_check(fit1)
predicted <- predict(fit1)
residuals <- resid(fit1)
plot(predicted, residuals, xlab="predicted value", ylab="residual",
main="Residuals vs. predicted values", pch=20)
abline(0, 0, col="gray", lwd=.5)
#rootogram(fit1)
## Yes, there is evidence of overdispersion. The residuals are huge, and the posterior predictive check shows that the model is not fitting the data well.
To summarize:
sex is the sex of the person, recorded as “man” or “woman” herecouples is an indicator for if the couple was counseled togetherwomen_alone is an indicator for if the woman went to counseling by herselfbs_hiv indicates if the individual is HIV positivebupacts is the number of unprotected sex acts reported as a baseline (before treamtnet)fupacts is the number of unprotected sex acts reported at the end of the studyNext extend the model to include pre-treatment measures of the outcome and the additional pre-treatment variables included in the dataset. Does the model fit well? Is there evidence of overdispersion?
fit2 <- stan_glm(fupacts_R ~ couples + women_alone + bs_hiv + log(bupacts + 1) + sex, family = poisson(link = "log"), data = risk, refresh = 0)
fit2
## stan_glm
## family: poisson [log]
## formula: fupacts_R ~ couples + women_alone + bs_hiv + log(bupacts + 1) +
## sex
## observations: 434
## predictors: 6
## ------
## Median MAD_SD
## (Intercept) 1.0 0.0
## couples -0.3 0.0
## women_alone -0.5 0.0
## bs_hivpositive -0.4 0.0
## log(bupacts + 1) 0.7 0.0
## sexwoman 0.1 0.0
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
pp_check(fit2)
predicted <- predict(fit2)
residuals <- resid(fit2)
plot(predicted, residuals, xlab="predicted value", ylab="residual",
main="Residuals vs. predicted values", pch=20)
abline(0, 0, col="gray", lwd=.5)
# The model does not fit well (based on pp_check plot) and there is evidence of overdispersion (based on residual plot being very spread out)
Fit a negative binomial (overdispersed Poisson) model. What do you conclude regarding effectiveness of the intervention?
fit3 <- glm.nb(fupacts_R ~ couples + women_alone + bs_hiv + log(bupacts + 1) + sex, data=risk, link="log")
fit3
##
## Call: glm.nb(formula = fupacts_R ~ couples + women_alone + bs_hiv +
## log(bupacts + 1) + sex, data = risk, link = "log", init.theta = 0.4357586657)
##
## Coefficients:
## (Intercept) couples women_alone bs_hivpositive
## 1.31804 -0.36679 -0.64007 -0.51314
## log(bupacts + 1) sexwoman
## 0.61832 -0.05974
##
## Degrees of Freedom: 433 Total (i.e. Null); 428 Residual
## Null Deviance: 603.1
## Residual Deviance: 488 AIC: 2953
#pp_check(fit3)
ggplot()+
geom_point(aes(x=predict(fit3, type="response"), y=resid(fit3)))+
labs(x="predicted value", y="residual", title = "Residuals vs. predicted values")+
geom_abline(slope=0, intercept=0, color="gray")
## You can conclude that the intervention is effective, but it appears to be more effective for women alone than for couples. This is shown by the women_alone coefficient having a larger absolute value than the couples. However, both of the coefficients are negative, suggesting that either intervention reduces the number of unprotected sex acts.
These data include responses from both men and women from the participating couples. Does this give you any concern with regard to our modeling assumptions?
# This could be an issue because if both people in the couple are responding, this means that we are getting double the amount of information for these couples, thus doubling the response variable for these observations. This also means that not all the data points are independent of one another, but there is some collinearity going on.
Redo the basketball shooting example on page 270, making some changes:
Instead of having each player shoot 20 times, let the number of shots per player vary, drawn from the uniform distribution between 10 and 30.
set.seed(100)
N <- 100
height <- rnorm(N, 72, 3)
p <- 0.4 + 0.1*(height - 72)/3
n <- round(runif(N, 10, 30), digits = 0)
for (i in 1:n) {
y <- rbinom(N, i, p)
}
## Warning in 1:n: numerical expression has 100 elements: only the first used
data <- data.frame(n=n, y=y, height=height)
#fit1 <- stan_glm(cbind(y, n-y) ~ height, family = binomial(link = "logit"), data = data, refresh = 0)
#fit1
Instead of having the true probability of success be linear, have the true probability be a logistic function, set so that Pr(success) = 0.3 for a player who is 5’9" and 0.4 for a 6’ tall player.
N <- 100
height <- rnorm(N, 72, 3)
p <- invlogit(-.405 + .441*((height - 72)/3))
n <- round(runif(N, 10, 30), digits = 0)
for (i in 1:n) {
y <- rbinom(N, i, p)
}
## Warning in 1:n: numerical expression has 100 elements: only the first used
data <- data.frame(n=n, y=y, height=height)
round(data$y, digits = 0)
## [1] 10 12 10 13 12 12 5 12 6 10 7 3 7 9 5 5 11 11 7 6 6 6 10 7 8
## [26] 7 9 8 13 6 12 4 8 10 4 9 13 3 10 11 9 8 10 7 8 14 5 11 12 8
## [51] 6 10 10 5 10 8 6 15 9 5 9 11 11 1 13 8 11 3 13 10 13 6 7 16 10
## [76] 5 10 10 16 14 11 13 14 11 6 2 11 8 14 5 12 6 3 11 7 6 5 14 7 11
#fit2 <- stan_glm(cbind(y, n-y) ~ height, family = binomial(link = "logit"), data = data, refresh = 0)
#fit2
Experimental data from the National Supported Work example are in the folder Lalonde. Use the treatment indicator and pre-treatment variables to predict post-treatment (1978) earnings using a Tobit model. Interpret the model coefficients.
lalonde = foreign::read.dta("https://github.com/avehtari/ROS-Examples/blob/master/Lalonde/NSW_dw_obs.dta?raw=true")
fit <- vglm(re78 ~ treat + age + educ + black + hisp + married + nodegree + sample + educ_cat4, tobit(), data = lalonde)
summary(fit)
##
## Call:
## vglm(formula = re78 ~ treat + age + educ + black + hisp + married +
## nodegree + sample + educ_cat4, family = tobit(), data = lalonde)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## mu -2.6786 -0.5623 0.2785 0.76573 7.689
## loglink(sd) -0.9911 -0.6686 -0.3877 0.06843 61.550
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -9.172e+03 8.814e+02 -10.406 < 2e-16 ***
## (Intercept):2 9.343e+00 5.635e-03 1658.059 < 2e-16 ***
## treat 3.693e+03 9.781e+02 3.776 0.000159 ***
## age 5.513e+01 8.662e+00 6.365 1.96e-10 ***
## educ 3.574e+02 6.952e+01 5.142 2.73e-07 ***
## black -3.211e+03 2.979e+02 -10.779 < 2e-16 ***
## hisp -6.430e+02 3.504e+02 -1.835 0.066470 .
## married 4.759e+03 2.137e+02 22.264 < 2e-16 ***
## nodegree -1.005e+03 2.864e+02 -3.510 0.000448 ***
## sample 6.588e+03 2.551e+02 25.819 < 2e-16 ***
## educ_cat4 5.345e+02 1.994e+02 2.681 0.007349 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: mu, loglink(sd)
##
## Log-likelihood: -176905.5 on 37323 degrees of freedom
##
## Number of Fisher scoring iterations: 5
##
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):2'
## By looking at the signs of the coefficients, we can see that being black, hispanic, and having no degree all degrease expected post-treatment earnings. All other variables increase average expected earnings as the level of the predictor increases. For example, as a person's education level increases, their average expected earings increases as well. In addition, the only variable that is not statistically significant is hisp, although the p-value is 0.06, so it is barely above the threshold for being considered "significant", so it is hard to say whether we are seeing a true effect or not.
The folder Congress has the votes for the Democratic and Republican candidates in each U.S. congressional district in 1988, along with the parties’ vote proportions in 1986 and an indicator for whether the incumbent was running for reelection in 1988. For your analysis, just use the elections that were contested by both parties in both years.
congress = read.csv("https://raw.githubusercontent.com/avehtari/ROS-Examples/master/Congress/data/congress.csv")
congress88 <- data.frame(vote=congress$v88_adj,pastvote=congress$v86_adj,inc=congress$inc88)
Fit a linear regression using stan_glm with the usual normal-distribution model for the errors predicting 1988 Democratic vote share from the other variables and assess model fit.
fit1 <- stan_glm(vote ~ pastvote + inc, data = congress88, refresh = 0)
fit1
## stan_glm
## family: gaussian [identity]
## formula: vote ~ pastvote + inc
## observations: 435
## predictors: 3
## ------
## Median MAD_SD
## (Intercept) 0.2 0.0
## pastvote 0.5 0.0
## inc 0.1 0.0
##
## Auxiliary parameter(s):
## Median MAD_SD
## sigma 0.1 0.0
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
pp_check(fit1)
predicted <- predict(fit1)
residuals <- resid(fit1)
plot(predicted, residuals, xlab="predicted value", ylab="residual",
main="Residuals vs. predicted values", pch=20)
abline(0, 0, col="gray", lwd=.5)
## based on the pp_check plot and the residuals, this fit looks to be not great, but also not terrible. The pp_check plot shows that the model has the same bimodal shape as the simulations, but it seems a bit off. The residuals also show this bimodal pattern, but the residuals don't appear to be overdispersed.
Fit the same sort of model using the brms package with a t distribution, using the brm function with the student family. Again assess model fit.
fit2 <- brm(vote ~ pastvote + inc, data = congress88, family = student, refresh = 0)
## Compiling Stan program...
## Trying to compile a simple C file
## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -mmacosx-version-min=10.13 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DBOOST_NO_AUTO_PTR -include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/usr/local/include -fPIC -Wall -g -O2 -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:16: error: expected ';' after top level declarator
## namespace Eigen {
## ^
## ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
## ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
## Start sampling
summary(fit2)
## Family: student
## Links: mu = identity; sigma = identity; nu = identity
## Formula: vote ~ pastvote + inc
## Data: congress88 (Number of observations: 435)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 0.22 0.02 0.19 0.26 1.00 2012 2366
## pastvote 0.55 0.04 0.48 0.62 1.00 1961 2218
## inc 0.09 0.01 0.08 0.11 1.00 2084 2189
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.05 0.00 0.05 0.06 1.00 1869 2290
## nu 6.24 2.49 3.37 12.20 1.00 1901 2058
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
pp_check(fit2)
## Using 10 posterior samples for ppc type 'dens_overlay' by default.
predicted <- predict(fit2)
residuals <- resid(fit2)
plot(predicted, residuals, xlab="predicted value", ylab="residual",
main="Residuals vs. predicted values", pch=20)
abline(0, 0, col="gray", lwd=.5)
## Although the pp_check plot still isn't a perfect fit, the residuals no longer show a clear bimodal pattern, so this fit seems to be a bit better than the previous model.
Which model do you prefer?
Use the same data as the previous example with the goal instead of predicting for each district whether it was won by the Democratic or Republican candidate.
Fit a standard logistic or probit regression and assess model fit.
congress88$winner <- ifelse(congress88$vote>0.5,1,0)
fit1 <- glm(winner ~ pastvote + inc, family = binomial(link = "logit"), data = congress88)
fit1
##
## Call: glm(formula = winner ~ pastvote + inc, family = binomial(link = "logit"),
## data = congress88)
##
## Coefficients:
## (Intercept) pastvote inc
## -5.563 11.283 2.694
##
## Degrees of Freedom: 434 Total (i.e. Null); 432 Residual
## Null Deviance: 587.1
## Residual Deviance: 75.39 AIC: 81.39
# fit1 <- stan_glm(winner ~ pastvote + inc, family = binomial(link = "logit"), data = congress88)
# fit1
#pp_check(fit1)
binnedplot(fitted(fit1),resid(fit1))
# the residuals are a bit wacky but the pp_check plot appears to be fitting very well.
Fit a robit regression and assess model fit.
fit2 <- glm(winner ~ pastvote + inc, family = binomial(link = gosset(2)), data = congress88)
fit2
##
## Call: glm(formula = winner ~ pastvote + inc, family = binomial(link = gosset(2)),
## data = congress88)
##
## Coefficients:
## (Intercept) pastvote inc
## -5.668 11.594 3.715
##
## Degrees of Freedom: 434 Total (i.e. Null); 432 Residual
## Null Deviance: 587.1
## Residual Deviance: 78.96 AIC: 84.96
binnedplot(fitted(fit2),resid(fit2))
Which model do you prefer?
The folder RiskyBehavior contains data from a study of behavior of couples at risk for HIV; see Exercise 15.1.
Fit a Poisson regression predicting number of unprotected sex acts from baseline HIV status. Perform predictive simulation to generate 1000 datasets and record the percentage of observations that are equal to 0 and the percentage that are greater than 10 (the third quartile in the observed data) for each. Compare these to the observed value in the original data.
risky <- read.csv("https://raw.githubusercontent.com/avehtari/ROS-Examples/master/RiskyBehavior/data/risky.csv", header = TRUE)
Repeat (a) using a negative binomial (overdispersed Poisson) regression.
###(c) Repeat (b), also including ethnicity and baseline number of unprotected sex acts as inputs.
Exercise 15.7 used a Tobit model to fit a regression with an outcome that had mixed discrete and continuous data. In this exercise you will revisit these data and build a two-step model: (1) logistic regression for zero earnings versus positive earnings, and (2) linear regression for level of earnings given earnings are positive.
Compare predictions that result from each of these models with each other.